Control Node For Multi-Core System

ABSTRACT

A computing system with a plurality of nodes is disclosed. At least one of the plurality nodes includes an execution unit configured to execute an operation. An interconnection network is coupled to the plurality of nodes. The interconnection network is configured to provide interconnections among the plurality of nodes. A control node is coupled to the plurality of nodes via the network to manage the execution of the operation by the one or more of the plurality of nodes.

CLAIM OF PRIORITY

This application is a continuation application of U.S. application Ser.No. 14/331,741 filed Jul. 15, 2014, which is a continuation of U.S.application Ser. No. 13/493,216 filed on Jun. 11, 2012, now U.S. Pat.No. 8,782,196, which is a continuation of U.S. application Ser. No.12/367,690 filed on Feb. 9, 2009, now U.S. Pat. No. 8,200,799 which is acontinuation of U.S. application Ser. No. 10/443,501 filed on May 21,2003, now U.S. Pat. No. 7,653,710, which claims priority from U.S.Provisional Patent Application No. 60/391,874, filed on Jun. 25, 2002entitled “DIGITAL PROCESSING ARCHITECTURE FOR AN ADAPTIVE COMPUTINGMACHINE”; the disclosures of which are hereby incorporated by referenceas if set forth in full in this document for all purposes.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is related to U.S. patent application Ser. No.09/815,122, filed on Mar. 22, 2001, entitled “ADAPTIVE INTEGRATEDCIRCUITRY WITH HETEROGENEOUS AND RECONFIGURABLE MATRICES OF DIVERSE ANDADAPTIVE COMPUTATIONAL UNITS HAVING FIXED, APPLICATION SPECIFICCOMPUTATIONAL ELEMENTS”; U.S. patent application Ser. No. 10/443,596,filed on May 21, 2003, entitled, “PROCESSING ARCHITECTURE FOR ARECONFIGURABLE ARITHMETIC NODE IN AN ADAPTIVE COMPUTING SYSTEM”(Attorney Docket 21202-002910US); and U.S. patent application Ser. No.10/443,554 filed on May 21, 2003, entitled, “UNIFORM INTERFACE FOR AFUNCTIONAL NODE IN AN ADAPTIVE COMPUTING ENGINE” (Attorney Docket21202-003400US).

BACKGROUND

This invention relates in general to digital data processing and morespecifically to an interconnection facility for transferring digitalinformation among components in an adaptive computing architecture.

A common limitation to processing performance in a digital system is theefficiency and speed of transferring instruction, data and otherinformation among different components and subsystems within the digitalsystem. For example, the bus speed in a general-purpose Von Neumannarchitecture dictates how fast data can be transferred between theprocessor and memory and, as a result, places a limit on the computingperformance (e.g., million instructions per second (MIPS),floating-point operations per second (FLOPS), etc.).

Other types of computer architecture design, such as multi-processor orparallel processor designs require complex communication, orinterconnection, capabilities so that each of the different processorscan communicate with other processors, with multiple memory devices,input/output (I/O) ports, etc. With today's complex processor systemdesigns, the importance of an efficient and fast interconnectionfacility rises dramatically. However, such facilities are difficult todesign to optimize goals of speed, flexibility and simplicity of design.

SUMMARY

A hardware task manager indicates when input and output buffer resourcesare sufficient to allow a task to execute. The task can require anarbitrary number of input values from one or more other (or the same)tasks. Likewise, a number of output buffers must also be availablebefore the task can start to execute and store results in the outputbuffers.

The hardware task manager maintains a counter in association with eachinput and output buffer. For input buffers, a negative value for thecounter means that there is no data in the buffer and, hence, therespective input buffer is not ready or available. Thus, the associatedtask can not run. Predetermined numbers of bytes, or “units,” are storedinto the input buffer and an associated counter is incremented. When thecounter value transitions from a negative value to a zero the high-orderbit of the counter is cleared, thereby indicating the input buffer hassufficient data and is available to be processed by a task.

Analogously, a counter is maintained in association with each outputbuffer. A negative value for an output buffer means that the outputbuffer is available to receive data. When the high-order bit of anoutput buffer counter is set then data can be written to the associatedoutput buffer and the task can run.

Ports counters are used to aggregate buffer counter indications bytracking the high-order bit transitions of the counters. For example, ifa task needs 10 input buffers and 20 output buffers then an input portscounter is initialized and maintained by tracking availability of the 10allocated input buffers and 20 output buffers using simple incrementsand decrements according to high-order transitions of the buffer counterbits. When the high-order bit (i.e., the sign bit) of the ports countertransitions from a 1 to a 0, the associated task is ready to run.

In one embodiment the invention provides an apparatus for coordinatingbuffer use among tasks in a processing system, wherein the processingsystem includes a plurality of hardware nodes, wherein a task isexecuted on one or more of the hardware nodes, wherein a consuming taskuses input buffers to obtain data and wherein a producing task usesoutput buffers to provide data, the apparatus comprising a task managerfor indicating the status of the buffers, the task manager including anoutput buffer available indicator associated with an output buffer; aninput buffer available indicator associated with an input buffer; and astatus indicator for indicating that a task is ready to run based on acombination of the output buffer available indicator and the inputbuffer available indicator.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates the interface between heterogeneous nodes and thehomogenous network in the ACE architecture;

FIG. 2 illustrates basic components of a hardware task manager;

FIG. 3 shows buffers associated with ports;

FIG. 4 shows buffer size encoding;

FIG. 5 shows a look-up table format;

FIG. 6 shows counter operations;

FIG. 7 shows a table format for task state information;

FIG. 8 illustrates a data format for a node control register;

FIG. 9 shows the layout for a node status register;

FIG. 10 shows the layout for a Port/Memory Translation Table;

FIG. 11 shows a layout for a State Information Table;

FIG. 12 shows a summary of state transitions for a task;

FIG. 13 shows a layout for the a Module Parameter List and ModulePointer Table;

FIG. 14 shows an example of packing eight parameters associated withtask buffers;

FIG. 15 shows data formats for the Forward and Backward AcknowledgementMessages; and

FIG. 16 shows an overview of an adaptable computing engine architecture.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

A detailed description of an ACE architecture used in a preferredembodiment is provided in the patents referenced above. The followingsection provides a summary of the ACE architecture described in thereferenced patents.

Adaptive Computing Engine

FIG. 16 is a block diagram illustrating an exemplary embodiment inaccordance with the present invention. Apparatus 100, referred to hereinas an adaptive computing engine (ACE) 100, is preferably embodied as anintegrated circuit, or as a portion of an integrated circuit havingother, additional components. In the exemplary embodiment, and asdiscussed in greater detail below, the ACE 100 includes one or morereconfigurable matrices (or nodes) 150, such as matrices 150A through150N as illustrated, and a matrix interconnection network 110. Also inthe exemplary embodiment, and as discussed in detail below, one or moreof the matrices 150, such as matrices 150A and 150B, are configured forfunctionality as a controller 120, while other matrices, such asmatrices 150C and 150D, are configured for functionality as a memory140. The various matrices 150 and matrix interconnection network 110 mayalso be implemented together as fractal subunits, which may be scaledfrom a few nodes to thousands of nodes.

In a preferred embodiment, the ACE 100 does not utilize traditional (andtypically separate) data, DMA, random access, configuration andinstruction busses for signaling and other transmission between andamong the reconfigurable matrices 150, the controller 120, and thememory 140, or for other input/output (“I/O”) functionality. Rather,data, control and configuration information are transmitted between andamong these matrix 150 elements, utilizing the matrix interconnectionnetwork 110, which may be configured and reconfigured, in real-time, toprovide any given connection between and among the reconfigurablematrices 150, including those matrices 150 configured as the controller120 and the memory 140.

The matrices 150 configured to function as memory 140 may be implementedin any desired or exemplary way, utilizing computational elements(discussed below) of fixed memory elements, and may be included withinthe ACE 100 or incorporated within another IC or portion of an IC. Inthe exemplary embodiment, the memory 140 is included within the ACE 100,and preferably is comprised of computational elements which are lowpower consumption random access memory (RAM), but also may be comprisedof computational elements of any other form of memory, such as flash,DRAM, SRAM, MRAM, ROM, EPROM or E2PROM. In the exemplary embodiment, thememory 140 preferably includes direct memory access (DMA) engines, notseparately illustrated.

The controller 120 is preferably implemented, using matrices 150A and150B configured as adaptive finite state machines (FSMs), as a reducedinstruction set (“RISC”) processor, controller or other device or ICcapable of performing the two types of functionality discussed below.(Alternatively, these functions may be implemented utilizing aconventional RISC or other processor.) The first control functionality,referred to as “kernel” control, is illustrated as kernel controller(“KARC”) of matrix 150A, and the second control functionality, referredto as “matrix” control, is illustrated as matrix controller (“MARC”) ofmatrix 150B. The kernel and matrix control functions of the controller120 are explained in greater detail below, with reference to theconfigurability and reconfigurability of the various matrices 150, andwith reference to the exemplary form of combined data, configuration andcontrol information referred to herein as a “silverware” module.

The matrix interconnection network 110 of FIG. 16, includes subsetinterconnection networks (not shown). These can include a booleaninterconnection network, data interconnection network, and othernetworks or interconnection schemes collectively and generally referredto herein as “interconnect”, “interconnection(s)” or “interconnectionnetwork(s),” or “networks,” and may be implemented generally as known inthe art, such as utilizing FPGA interconnection networks or switchingfabrics, albeit in a considerably more varied fashion. In the exemplaryembodiment, the various interconnection networks are implemented asdescribed, for example, in U.S. Pat. No. 5,218,240, U.S. Pat. No.5,336,950, U.S. Pat. No. 5,245,227, and U.S. Pat. No. 5,144,166, andalso as discussed below and as illustrated with reference to FIGS. 7, 8and 9. These various interconnection networks provide selectable (orswitchable) connections between and among the controller 120, the memory140, the various matrices 150, and the computational units (or “nodes”)and computational elements, providing the physical basis for theconfiguration and reconfiguration referred to herein, in response to andunder the control of configuration signaling generally referred toherein as “configuration information”. In addition, the variousinterconnection networks (110, 210, 240 and 220) provide selectable orswitchable data, input, output, control and configuration paths, betweenand among the controller 120, the memory 140, the various matrices 150,and the computational units, components and elements, in lieu of anyform of traditional or separate input/output busses, data busses, DMA,RAM, configuration and instruction busses.

It should be pointed out, however, that while any given switching orselecting operation of, or within, the various interconnection networksmay be implemented as known in the art, the design and layout of thevarious interconnection networks, in accordance with the presentinvention, are new and novel, as discussed in greater detail below. Forexample, varying levels of interconnection are provided to correspond tothe varying levels of the matrices, computational units, and elements.At the matrix 150 level, in comparison with the prior art FPGAinterconnect, the matrix interconnection network 110 is considerablymore limited and less “rich”, with lesser connection capability in agiven area, to reduce capacitance and increase speed of operation.Within a particular matrix or computational unit, however, theinterconnection network may be considerably more dense and rich, toprovide greater adaptation and reconfiguration capability within anarrow or close locality of reference.

The various matrices or nodes 150 are reconfigurable and heterogeneous,namely, in general, and depending upon the desired configuration:reconfigurable matrix 150A is generally different from reconfigurablematrices 150B through 150N; reconfigurable matrix 150B is generallydifferent from reconfigurable matrices 150A and 150C through 150N;reconfigurable matrix 150C is generally different from reconfigurablematrices 150A, 150B and 150D through 150N, and so on. The variousreconfigurable matrices 150 each generally contain a different or variedmix of adaptive and reconfigurable nodes, or computational units; thenodes, in turn, generally contain a different or varied mix of fixed,application specific computational components and elements that may beadaptively connected, configured and reconfigured in various ways toperform varied functions, through the various interconnection networks.In addition to varied internal configurations and reconfigurations, thevarious matrices 150 may be connected, configured and reconfigured at ahigher level, with respect to each of the other matrices 150, throughthe matrix interconnection network 110. Details of the ACE architecturecan be found in the related patent applications, referenced above.

Hardware Task Manager

FIG. 1 illustrates the interface between heterogeneous nodes and thehomogenous network in the ACE architecture. This interface is referredto as a “node wrapper” since it is used to provide a common input andoutput mechanism for each node. A node's execution units and memory areinterfaced with the network and with control software via the nodewrapper to provide a uniform, consistent system-level programming model.Details of the node wrapper can be found in the related patentapplications referenced, above.

In a preferred embodiment, each node wrapper includes a hardware taskmanager (HTM) 200. Node wrappers also include data distributor 202,optional direct memory access (DMA) engine 204 and data aggregator 206.The HTM coordinates execution, or use, of node processors and resources,respectively. The HTM does this by processing a task list and producinga ready-to-run queue. The HTM is configured and controlled by aspecialized node referred to as a K-node or control node (not shown).However, other embodiment can use other HTM control approaches.

A task is an instance of a module, or group of instructions. A modulecan be any definition of processing, functionality or resource access tobe provided by one or more nodes. A task is associated with a specificmodule on a specific node. A task definition includes designation ofresources such as “physical” memory and “logical” input and outputbuffers and “logical” input and output ports of the module; and byinitializing configuration parameters for the task. A task has fourstates: Suspend, Idle, Ready, Run.

A task is created by the K-node writing to control registers in the nodewhere the task is being created, and by the K-node writing to controlregisters in other nodes, if any, that will be producing data for thetask and/or consuming data from the task. These registers are memorymapped into the K-node's address space, and “peek and poke” networkservices are used to read and write these values.

A newly created task starts in the suspend state. Once a task isconfigured, the K-node can issue a “go” command, setting a bit in acontrol register. The action of this command is to move the task fromthe “suspend” state to the “idle” state.

When the task is “idle” and all its input buffers and output buffers areavailable, the task is ADDed to the ready-to-run queue which isimplemented as a FIFO; and the task state is changed to “ready/run”.

Note: Buffers are available to the task when subsequent task executionwill not consume more data than is present in its input buffer(s) orwill not produce more data that there is capacity in its outputbuffer(s).

When the execution unit is not busy and the FIFO is not empty, the tasknumber for the next task that is ready to execute is REMOVEd from theFIFO, and the state of this task is “run”. In the “run” state, the taskconsumes data from its input buffers and produces data for its outputbuffers. For PDU, RAU and RBU unit types, only one task can be in the“run” state at a time, and the current task cannot be preempted. Theserestrictions are imposed to simplify hardware and software control.

When the task completes processing:

-   -   1) if the task's GO bit is zero, its state will be set to        SUSPEND; or    -   2) if (its GO bit is one) AND (its PORTS_COUNTER msb is one),        its state will be set to idle; or    -   3) if (its GO bit is one) AND (the FIFO is not empty) AND (its        PORTS_COUNTER msb is zero) the task will be ADDed to the        ready-to-run queue and its state will be “ready”; or    -   4) if (its GO bit is one) AND (the FIFO is empty) AND (its        PORTS_COUNTER msb is zero), its state will remain “run”; the        task will execute again since its status is favorable and there        is no other task waiting to run.

The K-node can clear the task's GO bit at any time. When the taskreaches the “idle” state and its GO bit is zero, its state willtransition to “suspend”.

The K-node can determine if a task is hung in a loop by setting andtesting status. When the K-node wishes to stop a run-away task, itshould clear the task's GO bit and issue the “abort” command to resetthe task's control unit. After reset, the task's state will transitionto “idle”. And, if its GO bit has been cleared, its state willtransition to “suspend”.

Task Lists

A node has a task list, and each task is identified by its “tasknumber”. Associated with each task are the following:

-   -   Task_number [4:0]—The task number, in the range of 0 to 31.    -   State [1:0] with values:        -   ‘00’=suspended        -   ‘01’=idle        -   ‘10’=ready        -   ‘11’=run    -   Go_bit with values:        -   0=stop        -   1=go    -   Module—Pointer to the module used to implement this task. For        reconfigurable hardware modules, this may be a number that        corresponds to a specific module. For the PDU, this is the        instruction memory address where the module begins.    -   Ports_counter—The negative number of input ports and output        ports that must be available before the task state can        transition from “idle” to “ready”. For example, an initial value        of −3 might indicate that two input ports and one output port        must be available before the task state changes to “ready”. When        a port changes from “unavailable” to “available”, Ports_counter        is incremented by one. When a port changes from “available” to        “unavailable”, Ports_counter is decremented by one. When the        value for Ports_counter reaches (or remains) zero and the task        state is “idle”, task state transitions to “ready”. The sign        (high-order) bit of this counter reflects the status of all        input ports and output ports for this task. When it is set, not        all ports are available; and when it is clear, then all ports        are available, and task state transitions from “idle” to        “ready”.

Each task can have up to four input buffers. Associated with each inputbuffer are the following:

-   -   In port_number(0,1,2,3) [4:0]—a number in the range of 0 to 31.    -   Mem_hys_addr [k:0]—The physical address in memory of the input        buffer.    -   Size [3:0]—a power-of-two coding for the size of the input        buffer.    -   Consumer_count [15:0]—a two's complement count, with a range of        −32768 to +32767, for input buffer status. It is initialized by        the K-node, incremented by an amount Fwdackval by the upstream        producer and incremented by an amount Negbwdackval by the        consumer (this task). The sign (high-order) bit of this counter        indicates input buffer status. When it is set (negative), the        buffer is unavailable to this task; and when it is clear        (non-negative), the buffer is available to this task.    -   Bwdackval [15:0]—the negative backward acknowledge value with a        range of −32768 to 0.    -   Producer_task_number [4:0]—a number in the range of 0 to 31        indicating the producer's task number for counter maintenance,        including backward acknowledgement messages to remote producers.    -   Producer_outport_number [4:0]—a number in the range of 0 to 31        indicating the producer's output port number for counter        maintenance, including backward acknowledgement messages to        remote producers.    -   Producer_node_number [6:0]—a number in the range of 0 to 127        indicating a remote producer's node number for routing backward        acknowledgement messages to remote producers.

Each task can have up to four output buffers. Associated with eachbuffer is the following:

-   -   Out_port_number(0,1,2,3) [4:01]—a number in the range of 0 to        31.    -   Mem_phys_addr [k:0]—The physical address in memory of the output        buffer, if local.    -   Size [3:0]—a power-of-two coding for the size of the output        buffer, if local.    -   Producer_count [15:0]—a two's complement count, with a range of        −32768 to +32767, for output buffer status. It is initialized by        the K-node, incremented by an amount Fwdackval by the producer        (this task) and incremented by an amount Negbwdackval by the        downstream consumer. The sign (high-order) bit of this counter        indicates output buffer status. When it is set (negative), the        buffer is available to this task; and when it is clear        (non-negative), the buffer is unavailable to this task.    -   Fwdackval [15:0]—the forward acknowledge value with a range of 0        to +32767.    -   Consumer_task_number [4:0]—a number in the range of 0 to 31        indicating the consumer's task number for counter maintenance,        including forward acknowledgement messages to remote consumers.    -   Consumer_in_port_number [4:0]—a number in the range of 0 to 31        indicating the consumer's input port number for counter        maintenance, including forward acknowledgement messages to        remote consumers.    -   Consumer_node_number [6:0]—a number in the range of 0 to 127        indicating a remote consumer's node number for routing data and        forward acknowledgement messages to remote consumers.    -   Parms_pointer [k:0]—The physical address in memory indicating        the first of tbd entries containing the task's configuration        parameters.

A preferred embodiment of the invention uses node task lists. Each listcan designate up to 32 tasks. Each of the up to 32 tasks can have up tofour input ports (read ports) and up to four output ports (write ports).A node can have 32 input ports and 32 output ports. 5-bit numbers areused to identify each port. Each number is associated with a 20-bitaddress in the contiguous address space for 1024 kilobytes of physicalmemory.

HTM Components

FIG. 2 illustrates basic components of an HTM. These includeport-to-address translation table 220, ACKs processor 222, ready-to-runqueue 224, state information 226, parameters pointers 228, andparameters memory 230.

Port-to-Address Translation Table

Under K-node control, the execution units in each node can write intoany memory location in the 20-bit contiguous address space. Accessingpermissions are controlled by the port number-to-physical addresstranslation tables. There are 32 entries in the table to support up to32 ports at each node's input.

Each of the 32 ports at each node's input can be assigned to an outputport of any task executing on any node (including “this node”) on thedie. Each port number is associated with a “power-of-2” sized bufferwithin one or more of the node's physical memory blocks as shown in FIG.3.

The 20-bit contiguous address space is accessible by a 6-bit node number(the six high order bits) and a 14-bit (low order bits) byte address forthe 16 KBytes within a tile.

Because network transfers are 32-bit transfers, 16-bit longwordaddresses are stored in the translation tables, and the two lower orderaddress bits are inferred (and set to ‘00’ by each memory's addressmux). The power-of-two buffer size is encoded in a four-bit value foreach entry in the table as shown in FIG. 4.

The translation table is loaded/updated by the K-node. When a taskwrites to this node, its output port number is used to access the table.Its accompanying data is written into the current address [ADDR] that isstored in the table, and the next address [NXTADDR] is calculated asfollows:

-   -   BASE=SIZE*INT {ADDR/SIZE}    -   OFFSET=ADDR-BASE    -   NXTOFFSET=(VAL+1) mod SIZE    -   NXTADDR=BASE+NXTOFFSET

ACKs Processor

Tasks communicate through buffers. Buffers are accessed via portnumbers. Each active buffer is associated with a producer task and aconsumer task. Each task maintains a count reflecting the amount of datain the buffer. As the producer writes data into the buffer, it updatesits producer_counter with a value, Fwdackval, equal to the number ofbytes that it has produced (written). It also updates the correspondingConsumer_count, using a FWDACK message if the consumer is remote (not inits node).

When the consumer reads, and no longer requires access to, data in thebuffer, it updates its Consumer_count with a value, Bwdackval, equal tominus the number of bytes that is has consumed. It also updates thecorresponding Producer_count, using a BWDACK message if the producer isremote.

Note: Data formats for the Forward and Backward Acknowledgement Messagesare shown in FIG. 15.

The ACKs processor includes a 64-entry by 16-bit LUT to store counts foreach of its (up to) 32 input ports and 32 output ports. The format forthis LUT is shown in FIG. 5.

The counters are initialized with negative values by the K-node.Producer counters are accessed by their associated output port numbers;consumer counters are accessed by their associated input port numbers.

Producer counters are incremented by Fwdackvals from their associatedtasks, and they are incremented by Bwdackvals from the downstream tasksthat consume the data. Consumer counters are incremented by Bwdackvalsfrom their associated tasks, and they are incremented by Fwdackvals fromthe upstream tasks that produce the data.

Note that incrementing by a Bwdackval, a negative value, is equivalentto decrementing by a positive value, producing a more negative result.

These operations are summarized in FIG. 6. In FIG. 6, an upstream taskis the producer (writer) of Buffer A. One of the upstream task's outputport numbers is associated with Buffer A and its producer counter. Theproducer counter is incremented by the upstream task's Fwdackval, and itis incremented by this task's Bwdackval. In FIG. 6, this task is theconsumer (reader) of Buffer A. One of this task's input port numbers isassociated with Buffer A and its consumer counter. The consumer counteris incremented by the upstream task's Fwdackval and it is incremented bythis task's Bwdackval. In FIG. 6, this task is the producer (writer) ofBuffer B. One of this task's output port numbers is associated withBuffer B and its producer counter. The producer counter is incrementedby this task's Fwdackval and it is incremented by the downstream task'sBwdackval. In FIG. 6, a downstream task is the consumer (reader) ofBuffer B. One of the downstream task's input port numbers is associatedwith Buffer B and its consumer counter. The consumer counter isincremented by this task's Fwdackval and it is incremented by thedownstream task's Bwdackval.

An input buffer is available to its associated task when the high orderbit of its consumer counter is clear, indicating a non-negative count.An input buffer is not available to its associated task when the bit isset, indicating a negative count. Consumer counters are initialized (bythe K-node) with the negative number of bytes that must be in its inputbuffer before the associated task can execute. When the high order bitis clear, indicating buffer availability, the task is assured that thedata it will consume during its execution is in the buffer.

An output buffer is available to its associated task when the high orderbit of its producer counter is set, indicating a negative count. Anoutput buffer is not available to its associated task when the bit isclear, indicating a non-negative count. Producer counters areinitialized (by the K-node) with a negative number of bytes that it canproduce before it must suspend task execution. An available outputbuffer indication assures the task that there is sufficient buffercapacity for execution with no possibility of overflow.

The initial values for these counters are functions of Ackvals and thedesired numbers of task execution iterations after initialization.

To avoid deadlocks, the minimum buffer size must be the next power oftwo that exceeds the sum of the maximum absolute values of Fwdackvalsand Bwdackvals. For example, for Fwdackval=51 and Bwdackval=−80, thebuffer size must be greater than, or equal to, 256.

Counters are updated when ACKVAL messages arrive from the network andfrom locally executing tasks. When the high order bits of the currentcount and the updated count are different, a change of status indicationis generated along with the associated task number, so that its STATEPorts_counter can be incremented or decremented. For input ports, theports_counter is decremented for 0-to-1 transitions, and it isincremented for 1-to-0 transitions. For output ports, the ports_counteris incremented for 0-to-1 transitions, and it is decremented for 1-to-0transitions.

When the high order bit of the Ports_counter transitions from 1 to 0,the associated task is ready to run; and it is ADDed to the Ready-to-RunQueue. Also, when the current task completes and its ACKs have beenprocessed, if its GO bit is zero, its STATE is set to SUSPEND. Else, ifits Ports_counter msb is clear, it is ready to run again; and, if theFIFO is empty, it runs again; or, if the FIFO is not empty, it is ADDedto the queue. Finally, if its GO bit is one, but its Ports_counter msbis clear, its STATE is set to IDLE; and it must wait for the nextPorts_counter msb transition from 1 to 0 before it is once again readyto run and ADDed to the queue.

Ready-to-Run Queue

The Ready-to-Run Queue is a 32-entry by 5 bits per entry FIFO thatstores the task numbers of all tasks that are ready to run. The K-nodeinitializes the FIFO by setting its 5-bit write pointer (WP) and its5-bit read pointer (RP) to zero. Initialization also sets the fifostatus indication: EMPTY=1.

When a task is ready to run, its task number is ADDed to the queue atthe location indicated by WP, and WP is incremented. For every ADD,EMPTY is set to 0.

When the execution unit is idle and the FIFO is not empty (EMPTY=0), thetask number for the next task to be executed is REMOVEd from the queueat the location indicated by RP. When the task is completed, RP isincremented. And, if RP=WP, EMPTY is set to 1.

The FIFO is FULL when [(RP=WP) AND (EMPTY=0)].

State Information Table

State information for each of (up to) 32 tasks is maintained in a32-entry by 6 bit table that is accessed by one of 32 task numbers. Theformat for this table is shown in FIG. 7.

The State Information Table is initialized by the K-node (POKE). TheK-node also can monitor the state of any task (PEEK). In addition to theK-node's unlimited access to the table, other accesses to it arecontrolled by a FSM that receives inputs from the ACKs Processor, theReady-to-Run Queue, and the Execution Unit as shown in FIG. 8. Detailsof this FSM are beyond the scope of this paper.

Parms Pointers

Associated with each task is a register that contains the physicaladdress where the first of the task's configuration parameters is storedin a contiguous chunk of memory.

Parms Memory

Each task's configuration parameters—or Module Parameter List (MPL),—arestored in a contiguous chunk of memory referenced by the task's ParmsPointer. The numbers of parameters and their purposes will vary from onetask to another. As tasks are designed, their specific requirements forconfiguration parameters will be determined and documented.

Typically, these requirements will include:

Module—Pointer to the module used to implement this task. Forreconfigurable hardware modules, this may be a number that correspondsto a specific module. For the PDU, this is the instruction memoryaddress where the module begins.

For each of up to four buffers from which the task will consume (read)data:

-   -   Memory Physical Address    -   Buffer Size    -   Input Port Number    -   Producer Task Number    -   Producer Output Port Number    -   Producer Node Number (if remote)    -   Producer (Local/Remote); boolean    -   Bwdackval

For each of up to four buffers into which the task will produce (write)data:

-   -   Memory Physical Address (if local)    -   Buffer Size (if local)    -   Output Port Number    -   Consumer Task Number    -   Consumer Input Port Number    -   Consumer Node Number (if remote)    -   Consumer (Local/Remote); boolean    -   Fwdackval

For each presettable counter (for example: number of iterations count;watchdog count)

-   -   (Counter Modulus-1)

Node Control Register (NCR)

The layout for the Node Control Register is shown in FIG. 8.

ENB—Bit 15—When the NCR Enable bit is clear, the node ceases alloperation, except that it continues to support PEEK and POKE operations.The NCR Enable bit must be set to 1 to enable any other node functions.

ABT—Bit 14—Writing (POKING) the NCR with Bit 14 set to 1 generates anAbort signal to the execution unit, causing it to halt immediately, Thestate of the aborted task transitions to IDLE; and if its GO bit hasbeen cleared (as it should be prior to issuing the Abort), the statewill transition to SUSPEND. This is the K-node's sledge hammer toterminate a runaway task. Writing the NCR with Bit 14=0 is no operation.When reading (PEEKING) NCR, zero will be returned for Bit 14.

RSV—Bit 13—At this time, Bit 13 is unused. When writing the NCR, Bit 13is don't care, and when reading NCR, zero will be returned for Bit 13.

WPE—Bit 12—Writing the NCR with Bit 12 set to 1 results in the writingof the [9:5] value into Queue Write Pointer (with ENB=0, a diagnosticsWRITE/READ/CHECK capability). Writing the NCR with Bit 12=0 is nooperation. When reading NCR, zero will be returned for Bit 12.

RPE—Bit 11—Writing the NCR with Bit 11 set to 1 results in the writingof the NCR[4:0] value into Queue Read Pointer (with ENB=0, a diagnosticsWRITE/READ/CHECK capability). Writing the NCR with Bit 11=0 is nooperation. When reading NCR, zero will be returned for Bit 11.

Queue Initialization

Writing the NCR with Bits 12 and 11 set to 1 and with Bits [9:5] andBits [4:0] set to zeros initializes the queue, setting the Write Pointerto zero, the Read Pointer to zero, and the Queue Empty Status Flag to 1.

Queue Empty Status Flag—Bit 10—READ ONLY Bit 10, the Queue Empty StatusFlag, is set to 1 when the Ready-to-Run FIFO is empty; it is set to 0when it is not empty. When Bit 10 is set to 1, the Write Pointer (NCR[9:5]) and Read Pointer (NCR [4:0]) values will be the same. When thepointer values are the same, and Bit 10=0, the FIFO is FULL. Whenwriting NCR, Bit 10 is don't care.

Queue Write Pointer—Bits [9:5]—For diagnostics WRITE/READ/CHECKcapability (and for queue initialization), writing NCR with Bit 12=1results in the writing of the NCR[9:5] value into Queue Write Pointer.When writing NCR with Bit 12=0, Bits [9:5] are don't care. When readingNCR, Bits [9:5] indicate the current Queue Write Pointer value.

Queue Read Pointer—Bits [4:0]—For diagnostics WRITE/READ/CHECKcapability (and for queue initialization), writing NCR with Bit 11=1results in the writing of the NCR[4:0] value into Queue Read Pointer.When writing NCR with Bit 11=0, Bits [4:0] are don't care. When readingNCR, Bits [4:0] indicate the current Queue Read Pointer value.

Node Status Register (NSR)

The layout for the Node Status Register is shown in FIG. 9. The NodeStatus Register is a READ ONLY register. READING NSR clears Bits 14 and13. WRITING NSR is no operation.

ENB—Bit 15—Bit 15, Enable, simply indicates the state of NCR [15]:Enable.

ABT—Bit 14—When an Abort command is issued (WRITE NCR, Bit 14=1), theexecuting task is suspended, after which the Abort Status Bit 14 is setto 1. Reading NSR clears Bit 14.

TCS—Bit 13—The Task Change Status Bit 13 is set to 1 when an executionunit REMOVEs a TASK # from the Ready-to-Run Queue. Reading NSR clearsBit 13. The K-node can perform a “watch dog” operation by reading NSR,which clears Bit 13, and reading NSR again after a time interval. Afterthe second read, if Bit 13 is set to 1, another REMOVE (initiatingexecution of the next task) has occurred during the time interval. IfBit 13=0, another REMOVE has not occurred during the time interval.

NRS—Bit 12—This bit is set to 1 when the node is executing a task. Whenthe bit=0, the node is not executing a task.

Reserved—Bits [11:5]—These bits are not assigned at this time, andreading the NSR results in zeros being returned for Bits [11:5]

Current Task Number—Bits [4:0]—Bits [4:0] is the 5-bit number (tasknumber) associated with the task currently executing (if any).

Port/Memory Translation Table (PTT)

The layout for the 32-entry Port/Memory Translation Table (PTT) is shownin FIG. 4.

Producers Counters Table (PCT); Consumers Counters Table (CCT)

The layouts for the 32-entry Producers Counters Table (PCT) and the32-entry Consumers Counters Table (CCT) are shown in FIG. 5.

Ready-to-Run Queue (RRQ)

The layout for the 32-entry Ready-to-Run Queue (RRQ) is shown in FIG.10.

Reserved—Bits [15:5]—These bits are not assigned at this time, andreading the RRQ results in zeros being returned for Bits [15:5].

Task Number—Bits [4:0]—The K-node can PEEK/POKE the 32-entry by 5-bittable for diagnostics purposes.

State Information Table (SIT)

The layout for the 32-entry State Information Table (SIT) is shown inFIG. 11. The 32-entry SIT is initialized by the K-node. This includessetting the initial value for the Ports_counter, the STATE_bit to zero,and the GO_bit=0. Thereafter, the K-node activates any of up to 32 tasksby setting its GO bit=1. The K-node de-activates any of up to 32 tasksby setting its GO_bit=0.

Prior to issuing an ABORT command, the K-node should clear the GO_bit ofthe task that is being aborted.

Bit 15, the GO_bit, is a READ/WRITE bit.

Bits [12:5] are unassigned at this time. For WRITE operations, they aredon't care, and for READ operations, zeros will be returned for thesefields.

When the SIT is written with Bit 13 (STATE Write Enable) set to 1, theSTATE Bit for the associated task is set to the value indicated by Bit[14]. When Bit 13 is set to zero, there is no operation. For READoperations, the current STATE Bit for the associated task is returnedfor Bit [14], and a zero is returned for Bit 13.

When the SIT is written with Bit 4 (Ports_counter Write Enable) set to1, the Ports_counter for the associated task is set to the valueindicated by Bits [3:0]. When Bit 4 is set to zero, there is nooperation. For READ operations, the current value of Ports_counter forthe associated task is returned for Bits [3:0], and a zero is returnedfor Bit 4.

State transitions for a task are summarized in the table shown in FIG.12. Note that for each of the (up to) 32 tasks, the K-node can resolvemerged READY/RUN status by comparing any of 32 task numbers with thecurrent task number which is available in the Node Status Register,NSR[4:0].

MDL Pointer Table (MPT)

The layout for the 32-entry Module Parameter List (MPL) Pointer Table(MPT) is shown in FIG. 13. Associated with each task is a register thatcontains the physical address in a contiguous chunk of memory where thefirst of the task's tbd configuration parameters is stored.

Because there are unresolved issues associated with aggregatingmemories/tiles/tasks, we indicate a 16-bit memory pointer (assuminglongword address boundaries) which would allow the task to access itsconfiguration information from any memory within its quadrant.

Parms Memory Layouts

Each task's Module Parameter List (MPL) will be stored in a contiguouschunk of memory referenced by its associated Parms Pointer. The numbersof parameters and their purposes will vary from one task to another. Astasks are designed, their specific requirements for configurationparameters (and their associated layouts) will be determined anddocumented.

An example of packing eight parameters associated with each task bufferis shown in FIG. 14.

Forward/Backward Acknowledgement Message Formats

Data formats for the Forward and Backward Acknowledgement Messages areshown in FIG. 15.

Although the invention has been described with respect to specificembodiments, thereof, these embodiments are merely illustrative, and notrestrictive of the invention. For example, any type of processing units,functional circuitry or collection of one or more units and/or resourcessuch as memories, I/O elements, etc., can be included in a node. A nodecan be a simple register, or more complex, such as a digital signalprocessing system. Other types of networks or interconnection schemesthan those described herein can be employed. It is possible thatfeatures or aspects of the present invention can be achieved in systemsother than an adaptable system, such as described herein with respect toa preferred embodiment.

Thus, the scope of the invention is to be determined solely by theappended claims.

1. A computing device comprising: a plurality of computing nodes; amemory in at least one of the plurality of computing nodes, theplurality of computing nodes configured to make memory requests foraccess to the memory; an interconnection network operatively coupled tothe plurality of computing nodes, the interconnection network providinginterconnections among the plurality of computing nodes to route thememory requests; means for identifying a set of memory requests; meansfor determining when all of the memory requests in the set of memoryrequests have been performed; and means for initiating execution of atask when all of the memory requests in the set of memory requests havebeen performed. 2-28. (canceled)